Hive Metastore: Merge hive and avro schema if not consistent#55
Hive Metastore: Merge hive and avro schema if not consistent#55funcheetah wants to merge 3 commits intolinkedin:masterfrom funcheetah:master
Conversation
| cols.addAll(getPartitionCols(table)); | ||
| } | ||
|
|
||
| return convertFieldSchemaToAvroSchema(recordName, recordNamespace, true, cols); |
There was a problem hiding this comment.
Instead of passing cols which is a List can we create a struct field schema here and call HiveTypeToAvroType directly? We can get rid of convertFieldSchemaToAvroSchema
There was a problem hiding this comment.
I see this is a tail reference/call, we can just move the code in convertFieldSchemaToAvroSchema inside convertHiveSchemaToAvro. But aside from that, what do you mean by "create a struct field schema"? I see the parseSchemaFromStruct method from HiveTypeToAvroType, but it is a private method, are you referring to this method?
| for (int i = 0; i < fieldNames.size(); ++i) { | ||
| final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i); | ||
| String fieldName = fieldNames.get(i); | ||
| fieldName = removePrefix(fieldName); |
There was a problem hiding this comment.
Do we need this? I think field names being passed here are relative, they come from StructTypeInfo.getAllStructFieldNames() so I don't think they are qualified from the root. A . in the field name is probably the actual name of the field here.
There was a problem hiding this comment.
Yeah, I think it can be removed? also since we are dealing with hive, I feel the field names won't contain . anyways.
| // We don't cache the structType because otherwise it could be possible that a field | ||
| // "lastname" is of type "firstname", where firstname is a compiled class. | ||
| // This will lead to ambiguity. |
There was a problem hiding this comment.
I am not sure what this comment means. Which cache are we referring to?
| if (schemaStr != null) { | ||
| schema = AvroSchemaUtil.toIceberg(new org.apache.avro.Schema.Parser().parse(schemaStr)); | ||
| org.apache.avro.Schema avroSchema = new org.apache.avro.Schema.Parser().parse(schemaStr); | ||
| org.apache.avro.Schema hiveSchema = LegacyHiveSchemaUtils.convertHiveSchemaToAvro(table); |
There was a problem hiding this comment.
At this step during the conversion we pass mkFieldsOptional as true to make fields nullable, but in the very next line we remove nullables from the schema. Can we just mark fields as non-nullable to being with and remove LegacyHiveSchemaUtils.extractActualTypeIfFieldIsNullableTypeRecord?
There was a problem hiding this comment.
Good catch, I think I can change the signature of convertHiveSchemaToAvro to add this boolean flag parameter to `convertHiveSchemaToAvro(Table table, boolean mkFieldsOptional), so that this function directly return a non-null version of the schema.
| import org.slf4j.LoggerFactory; | ||
|
|
||
|
|
||
| public class LegacyHiveSchemaUtils { |
There was a problem hiding this comment.
The code in this class is way too verbose. This can be heavily simplified by using Visitors
There was a problem hiding this comment.
I feel the same, but I feel this will require non-trivial refactor of the code. I think we also want to publish this code soon, so there is this trade-off.
| import org.codehaus.jackson.node.JsonNodeFactory; | ||
|
|
||
|
|
||
| public class HiveTypeToAvroType { |
There was a problem hiding this comment.
will do later, but the integration test I ran already passed all the tables.
|
|
||
| org.apache.avro.Schema tableSchema = avroSchema; | ||
| boolean isHiveSchemaEvolved = | ||
| LegacyHiveSchemaUtils.isRecordSchemaEvolved(avroSchemaWithoutNullable, hiveSchemaWithoutNullable); |
There was a problem hiding this comment.
Seems like isRecordSchemaEvolved has to traverse the whole Schema tree. What do we gain by checking this first rather than just merging directly?
There was a problem hiding this comment.
I think this is actually a good point. I think the 2 logic can be combined into just one pass.
wmoustafa
left a comment
There was a problem hiding this comment.
Still going through the patch. Currently, there are many ways conversions take place and seems they could be simplified. For example, we can go from Hive type string to Hive TypeInfo (one type info to represent the whole schema) then to Avro schema.
| for (int i = 0; i < fieldNames.size(); ++i) { | ||
| final TypeInfo fieldTypeInfo = fieldTypeInfos.get(i); | ||
| String fieldName = fieldNames.get(i); | ||
| fieldName = removePrefix(fieldName); |
| schema = parseSchemaFromUnion((UnionTypeInfo) typeInfo, recordNamespace, recordName); | ||
| break; | ||
| default: | ||
| throw new UnsupportedOperationException("Conversion from " + category + " not supported"); |
| // For example, in tracking.CommunicationRequestEvent.specificRequest, | ||
| // PropGenerated and PropExternalCommunication have the same structure. In case of duplicate typeinfos, we generate |
There was a problem hiding this comment.
Best not to mention actual table and field names.
|
|
||
| final List<FieldSchema> cols = new ArrayList<>(); | ||
|
|
||
| cols.addAll(table.getSd().getCols()); |
There was a problem hiding this comment.
There were concerns around using getSd().getCols(). Could you check if we should use HiveMetastoreClient.getSchema()?
| for (TypeInfo ti : typeInfos) { | ||
| Schema candidate; | ||
| if (ti instanceof StructTypeInfo) { | ||
| StructTypeInfo sti = (StructTypeInfo) ti; |
| // a new record type for the duplicates. | ||
| List<Schema> schemas = new ArrayList<>(); | ||
|
|
||
| for (TypeInfo ti : typeInfos) { |
| private static final String SHORT_TYPE_NAME = "short"; | ||
| private static final String BYTE_TYPE_NAME = "byte"; | ||
|
|
||
| public HiveTypeToAvroType(String namespace, boolean mkFieldsOptional) { |
There was a problem hiding this comment.
Does it make sense to convert this to a utility class and move those parameters to convertFieldsTypeInfoToAvroSchema?
| Schema convertFieldsTypeInfoToAvroSchema(String recordNamespace, String recordName, List<String> fieldNames, | ||
| List<TypeInfo> fieldTypeInfos) { |
There was a problem hiding this comment.
Can we avoid using list of field names and list of field types throughout the PR? For example instead of of List<String> fieldNames and List<TypeInfo> fieldTypeInfos, we just pass StructTypeInfo. Usually the input is already a StructTypeInfo and then it is broken down to two lists then dealt with here. In cases where the original input comes as two lists, we can combine them using TypeInfoFactory. getStructTypeInfo() from Hive.
| // We don't cache the structType because otherwise it could be possible that a field | ||
| // "lastname" is of type "firstname", where firstname is a compiled class. | ||
| // This will lead to ambiguity. | ||
| schema = parseSchemaFromStruct((StructTypeInfo) typeInfo, recordNamespace, recordName); |
There was a problem hiding this comment.
Can we rename this and other methods to something like convertStructTypeInfoToAvroSchema?
No description provided.